GH-1179: Correct the size of var-width vector with >0 start offset during vector append#1180
GH-1179: Correct the size of var-width vector with >0 start offset during vector append#1180jordepic wants to merge 1 commit into
Conversation
…width vectors with non-zero start offsets VectorAppender computed the delta vector's data size as its last offset value, which is only correct when the offset buffer starts at zero. Vectors imported through the C data interface from sliced arrays can have a non-zero first offset; appending them copied the unreferenced data buffer prefix into the target, inflating it on every append until allocation eventually failed with OversizedAllocationException. Compute the data size as the distance between the first and last offsets, copy from the first offset, and rebase appended offsets accordingly. Fixes apache#1179.
|
Thank you for opening a pull request! Please label the PR with one or more of:
Also, add the 'breaking-change' label if appropriate. See CONTRIBUTING.md for details. |
|
Could a maintainer please add the bug-fix label here? |
|
Hey @lidavidm , @laurentgo , @wgtmac , @jbonofre - would one of you guys mind taking a look here? It's a pretty nefarious bug :) |
jbonofre
left a comment
There was a problem hiding this comment.
LGTM.
I believe the same issue is present in ListVector and LargeListVector. Maybe worth to address in this PR or a following up PR.
|
@jbonofre would you be able to label the PR so that those checks aren't failing? I don't have permissions. I can also follow up here with the list vector changes. Thank you for your review :) |
|
@jordepic absolutely |
|
I'm not sure that the label worked if you tried to add it, I appreciate all of your help here |
What's Changed
Fix VectorAppender data size computation for variable-width vectors with non-zero start offsets
When appending a variable width offset vector in DataFusion comet I was receiving exceptions
due to allocating too much memory. This is because Comet passes variable width arrays back
to Java where the initial offset vector entry is greater than 0. Prior to this change, arrow-java
determines how many bytes to copy by just looking at the last offset entry in the buffer,
completely disregarding the value of the first. If first = 100 and last = 200, Java will still
copy 200 bytes instead of 100. In this change we fix that.
Closes #1179